Introduction

In September 2020, Twitter released a new API endpoint and modified the payload of the old twitter objects. The new payload now includes several useful fields that were not previously available (e.g., like counts and impression data). Creating Python scripts to interact with the new API is useful to enable future studies that use the new fields.

The following repo provides Python modules to query the new Twitter API, instructions on how to create a locally-hosted MySQL database, modules to translate the data to SQL, and modules to create interactive network graphs. Storing data in MySQL provides the requisite structure to efficiently and effortlessly perform network analysis of mention, hashtag, or URLs networks.

## Data Collection
The analytic workflow starts with collecting data the V2 Twitter API by using functions found in the api_wrapper module. In order for user to access the Twitter API, users must first request a bearer token from the Twitter developer portal.
In this example, we query a set of Twitter accounts that were previously identified to post pro or anti-vaccination content by calling the get_users function. Seed user names are provided in the seed_accounts.csv file in the data folder on this GitHub repository.
Since, the Twitter API rate limit for user look up queries is 900 accounts per 15 minute, the DataCollection Notebook incorporates timing delays between requests to prevent time outs. The workflow also incorporates error handling for empty responses that result when querying deleted or suspended accounts.
centered image
Next, we query seed accounts for their last 200 posts by calling the get_user_activity function. Since useful information (e.g., public metrics) is nested within the user activity json objects, the get_user_activity function extracts and appends data into additional columns where possible.
centered image
Likewise, we call the extract_el function to parse mentions, hashtags, and URLs into a directed edgelist. Source nodes in this edge list record the author_id and author_screen_name of users who mention entities (i.e., users, hashtags, or URLs) in a post. Since users can mention several entities in individual posts, the activity data has a one-to-many relationship with the edge list. In this example, 783 seed users mentioned 49,022 users.
Although we did not peform additional data collection beyond what was mentioned, we could easily incorporate additional steps if needed. For example, we could collect user profile information for the 49,022 mentioned users. We could incorporate information from additional sources (e.g., query the Botometer API for bot likelihood scores). We could expand our analysis to a 2-hop network by querying mentioned users for account activity.
To make the api_wrapper module more effective, we need to improve error handling, identify optimal sleep times for deailing with rate limits, transition the API functions to a class object, improve documentation of available Twitter object fields, and write API response to json files as opposed to csv files.

Creating a MySQL Database

In retrospect, we would have preferred to create a cloud-hosted MySQL database using Amazon’s Relational Database Service (RDS). A cloud-hosted database would allow all team members to interact with a single database remotely. Instead, we created Docker container to locally host a MySQL database. We did this for two reasons: 1) it doesn’t cost anything to locally host a database, and 2) we were not familiar with how to set up a RDS database.
To creat the Docker container, we created an empty directory and added the following Docker .yml file to it:

centered image

Then, we navigated to the directory with the .yml file and executed the docker compose up command.

## Translating Data to MySQL
The DataETL_ToMySQL notebook provides an overview of the steps we used to create SQL table schemas and write .csv data files to our database. The entity relationship diagram of our database is shown below:
centered image
Below is an example of an ETL workflow for the user table.First, we create a SQL table where we specify the data types for each of the fields.
centered image
Next, we load the data we collected in the previous step and transform fields where appropriate (e.g., converting dates to date objects). In this example, we also had to replace apostrophes found within text fields with spaces because including them caused errors in the write step.
centered image
Last, we write the data frame, row-by-row, to the SQL database.
centered image

Data Analysis and Visualization

Now that the data is stored in a SQL database, we can query the data using pymysql package and network_vis_helper module. A simple example is provided in the DataAnalysisVisualization notebook to demonstrate how users can interact with the database and highlight the benefit of writing more effective queries (i.e., not using a SELECT * approach).

For example, if we want to view the top-10 most liked tweets, it takes over 5 seconds to load the entire statuses table into memory and use Python to filter the data to the top 10-records.

centered image

In contrast, if we answer this question using only SQL commands, it executes in less than a second (~0.12 seconds)

centered image

Another advantage is that we can leverage the structure of the edgelist table to generate network visualizations much easier. For example, we can quickly generate a graph of the most popular hash tags as follows:

centered image

A static image of the graph is provided below; however, the html-version of this graph allows a user to explore the graph interactively (e.g., zoom, reposition nodes, etc.). An interactive version of the graph can be viewed at the following link.

centered image